In the class, we have learned simple join. Sometimes, for some of the rows you are joining, you may not be able to find a matching records in another table. In that case, you still want to keep the records from one table after JOIN. In this lab-problem set, we will learn how to do that.
The following figure shows different types of joins, and how to run the query for each type.
joins
library(DBI)
# create database: this will create a file in our hard drive
db <- dbConnect(RSQLite::SQLite(), "~/Data/my472/facebook-db-mini.sqlite")
# reading the first file
congress <- read.csv("~/Data/my472/congress-facebook-2017.csv",
stringsAsFactors=F)
set.seed(20181128)
congress <- congress[sample(nrow(congress), 100), ]
# adding first table: user-level data
dbWriteTable(db, "congress", congress)
# testing that it works with a simple query
dbListFields(db, "congress")
dbGetQuery(db, 'SELECT * FROM congress LIMIT 5')
fls <- list.files("~/Data/my472/posts", full.names=TRUE)
set.seed(20181128)
for (f in fls){
message(f)
# read file into memory
fb <- read.csv(f, stringsAsFactors=F)
fb <- fb[sample(nrow(fb), size = nrow(fb)/20), ]
# adding to table in SQL database
dbWriteTable(db, "posts", fb, append=TRUE)
}
# testing that it works
dbListFields(db, "posts")
dbGetQuery(db, 'SELECT * FROM posts LIMIT 5')
# what if we make a mistake and want to remove the table?
# dbRemoveTable(db, "posts")
# and we close the connection for now
dbDisconnect(db)
library(DBI)
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.6
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ──────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
db <- dbConnect(RSQLite::SQLite(), "~/Data/my472/facebook-db-mini.sqlite")
dbGetQuery(db,
"SELECT COUNT(*) FROM posts")
## COUNT(*)
## 1 8530
dbGetQuery(db,
"SELECT COUNT(*) FROM congress")
## COUNT(*)
## 1 100
dbGetQuery(db,
"SELECT COUNT(*) FROM posts AS p
JOIN congress AS c
ON c.screen_name = p.screen_name
")
## COUNT(*)
## 1 1655
dbGetQuery(db,
"SELECT COUNT(*) FROM posts AS p
LEFT JOIN congress AS c
ON c.screen_name = p.screen_name
")
## COUNT(*)
## 1 8530
msg <- try(
dbGetQuery(db,
"SELECT COUNT(*) FROM posts AS p
RIGHT JOIN congress AS c
ON c.screen_name = p.screen_name
"))
message(msg)
## Error in result_create(conn@ptr, statement) :
## RIGHT and FULL OUTER JOINs are not currently supported
IS NULL conditiondbGetQuery(db,
"SELECT COUNT(*) FROM posts AS p
LEFT JOIN congress AS c
ON c.screen_name = p.screen_name
WHERE c.bioid IS NULL
")
## COUNT(*)
## 1 6875
dbDisconnect(db)
In SQL, you can run a nested query where you use a result from the first query as a condition for the second query. Where we find a top 3 most frequent posters in the congress, and then get all messages from them.
#connect to the full db
db <- dbConnect(RSQLite::SQLite(), "~/Data/my472/facebook-db.sqlite")
dbGetQuery(db,
"SELECT screen_name, COUNT(*) AS post_count
FROM posts
GROUP BY screen_name
ORDER BY post_count DESC
LIMIT 3
")
## screen_name post_count
## 1 CongressmanNadler 1520
## 2 HurdOnTheHill 1500
## 3 RepMikeThompson 1494
dbGetQuery(db,
"SELECT screen_name
FROM posts
GROUP BY screen_name
ORDER BY COUNT(*) DESC
LIMIT 3
")
## screen_name
## 1 CongressmanNadler
## 2 HurdOnTheHill
## 3 RepMikeThompson
library(DT)
dbGetQuery(db,
"SELECT *
FROM posts
WHERE screen_name IN ('CongressmanNadler', 'HurdOnTheHill', 'RepMikeThompson')
") %>%
datatable()
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
dbGetQuery(db,
"SELECT *
FROM posts
WHERE screen_name IN (SELECT screen_name
FROM posts
GROUP BY screen_name
ORDER BY COUNT(*) DESC
LIMIT 3)
") %>%
datatable()
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html